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The value of the likelihood is occasionally used by high energy physicists as a statistic to measure goodness- 
of-fit in unbinned maximum likelihood fits. Simple examples are presented that illustrate why this (seemingly 
intuitive) method fails in practice to achieve the desired goal. 



1. INTRODUCTION 

For every complex problem, there is a so- 
lution that is simple, neat, and wrong. 
H.L. Mencken 

The complex problem considered here is goodncss- 
of-fit (g.o.f.) for unbinned maximum likelihood fits in 
cases when binned g.o.f. methods and Kolmogorov- 
Smirnov are not well suited: 

A physicist, having fit a complicated model to his 
multi dimensional data to obtain estimates of the val- 
ues of certain parameters, is also expected to check 
how well the data match his model. In the sections 
that follow, we discuss a g.o.f. method, still occasion- 
ally used in high energy physics (HEP), that is simple, 
neat, and wrong. 



2. THE SNW 1 METHOD 



faulty resolution: We approximate this by replac- 
ing with the parameter estimate obtained 
from the fit to the actual data. 

This method has a long history of use in high energy 
physics. It's recommended by several excellent sta- 
tistical data analysis texts written by (and for) high 
energy particle physicists. Consequently, and because 
the method is "obvious" , it's still being used in (some) 
HEP analyses. 

Reference 0, written by a statistician and four 
physicists, describes the method, but criticizes: 

The likelihood of the data would appear to 
be a good [g.o.f.] candidate at first sight. 
Unfortunately, this carries little informa- 
tion as a test statistic, as we shall see. . . 

Since this was ignored, maybe its warning was not 
strong enough. I have found no mention of the method 
in texts written (solely) by statisticians. 



We start with a brief description of the method. (A 
true derivation, for obvious reasons, is not available.) 

observation: Maximum likelihood fits are performed 
by maximizing the likelihood L(0,x) with re- 
spect to the (unknown) parameters for fixed 
data x. 

faulty intuition: Thus, the value of the likelihood 
provides the g.o.f. between the data and the 
probability density function (p.d.f.): The value 
of the likelihood at the maximum, 



L(6, x) 



corresponds to the best fit — the smaller the like- 
lihood, the worse the g.o.f., . . . 

obstacle: To calculate this "g.o.f." P-value, we need 
the distribution of £ max for an ensemble of ran- 
dom x deviates from the p.d.f. using the true 
(but unknown) parameters 0. 



1 Simple, Neat, Wrong. 



3. A SIMPLE TEST OF THE METHOD 

Always test your general reasoning against 
simple models. John S. Bell 

Reference 0, following the above advice, tests the 
method against the p.d.f. 
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where t (we have in mind the decay-time of a par- 
ticle) follows an exponential distribution, and r (the 
mean lifetime) is a parameter whose value, being un- 
known, is estimated from data. The likelihood for N 
observations ti is given by 



InL = 



N r 
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The value (t) of r that maximizes the likelihood, and 
the value (L max ) of the likelihood at its maximum, are 
given by 
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3.1. The First Surprise 



and the g.o.f. statistic is now calculated as 



The value of the likelihood at its maximum (in this 
test case) is just a simple function of f — all samples 
with the same mean obtain the same "g.o.f." value. 
This is a disaster for g.o.f. Even if the true value of 
r — call it T — were known in advance, so that we could 
calculate the P-value associated with the observed f , 
merely comparing the f of the data with T is not 
sufficient to show that the observed data are modeled 
well by the exponential distribution. 



3.2. The Second Surprise 

Since under this method, our P-value ensemble is 
actually based on the value of f computed from the 
data (not knowing the true value T) , we always obtain 
a P-value of about 50%, for any data whatsoever. This 
is a second disaster for g.o.f. By construction, the 
distribution of L max from our ensemble of iV-event 
pseudo experiments tracks the i max observed from 
the data. 

The fact that the method yields "reasonable" P- 
valucs has undoubtedly contributed to its longevity 
in practice: P-values very near or 100% would have 
triggered further investigation. 



lnr + 



1 N 
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TnLmax = iV(l + lnf) -^ln^) 

i=l 
N 

= iV(l + lnf)-^ln(2V^) 



i=i 



That is, the "g.o.f." statistic is not invariant under 
change of variable in the continuous p.d.f. case. (The 
value of the m.l.e. is, of course, invariant.) 

Under change of variable, the "g.o.f." statistic picks 
up an extra term from the Jacobian — an extra func- 
tion of the data. We're free to choose any transfor- 
mation, so we can make the "g.o.f." statistic more or 
less anything at all — a serious pathology. 

At this point, experts point out that ratios of like- 
lihoods have the desired invariancc under change of 
variable, but, while the likelihood ratio is a useful test 
statistic in certain special cases, it is not at all clear 
how to obtain a useful g.o.f. statistic from the likeli- 
hood ratio in the general, unbinned, case. 



3.3. Lessons Learned 

In this example, g.o.f. is equivalent to testing the 
single hypothesis: "The data are from an exponential 
distribution of unspecified mean." L max provided no 
information with respect to this hypothesis. 

What went wrong? In our test case, the likelihood 
could be expressed as a function of just the param- 
eter and its maximum likelihood estimator (m.l.e.): 
i(r;f). All data samples with the same m.l.e. gave 
the same "g.o.f." 

Exactly the same thing happens in the Gauss- 
ian (normal) case the likelihood can be written us- 
ing solely the 2 parameters and their estimators: 
L(fi,a;fi,&). 

Other "textbook" distributions — scaled gamma, 
beta, log-normal, geometric — also fail in the same way. 
Geometric is a discrete distribution, so the problem is 
not restricted to the continuous case. 



4. MORE TROUBLE: NON INVARIANCE 

Returning to our exponential example, suppose we 
make the substitution t — x 2 . The p.d.f. transforms 
as 

le-^dt = —e~ x2 / T dx 

T T 



5. A REPLACEMENT MODEL 

Since we now lack an intuitive understanding, we 
need a replacement intuition for what is going on. I 
propose this model: 

Denote by Ho the hypothesis that the data are from 
the p.d.f. in question. Specify an alternative hypoth- 
esis Hi that the data are from a uniform p.d.f. (flat 
in the variables that we happen to have chosen). At 
least, the Hi p.d.f. is flat over the region where we 
have data — outside that region it can be cut off. 

Performing a classic Neyman-Pearson hypothesis 
test of H vs Hi , we use the ratio of their likelihoods 
as our test statistic: 

= L(x\H ) = L(x) 
L(x\Hi) constant 

So, the "g.o.f." statistic can be re-interpreted as suit- 
able for a hypothesis test that indicates which of Ho 
(our p.d.f.) and Hi (a flat p.d.f.) is more favored by 
the data — a well established statistical practice. 

The benefit of the new interpretation is that it ex- 
plains behaviors that were baffling under the g.o.f. in- 
terpretation: Neyman-Pearson hypothesis tests and 
g.o.f. tests behave quite differently. 

For example, a reasonable g.o.f. statistic should 
be at least approximately distribution independent, 
but X(x) is often highly correlated with the m.l.e. 's 
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(100% in our exponential case). This high correla- 
tion was confirmed in the example contributed by 
K. Kinoshita[3 to the 2002 Durham Conference. Not 
knowing the true value of the parameters then makes 
it difficult, or impossible, to use \{x) as g.o.f., since 
we don't know what \{x) should be. 2 The behav- 
ior of these correlations is natural and obvious in 
the hypothesis test picture: changing the parameters 
changes the "flatness" of the H p.d.f., and X(x) re- 
flects this. 

Reference 1] pointed out that, with no unknown pa- 
rameters, one can always transform the p.d.f. to a flat 
distribution. Then X(x) becomes constant indepen- 
dent of the data — bad news for g.o.f. In the hypoth- 
esis test picture, this becomes a comparison between 
two identical hypotheses, and the result is what we 
would expect. 



6. TEST BIAS 

Take the Ho p.d.f. to be 

e"' (t > 0) 

This distribution is fully specified — no unknown pa- 
rameters. Our "g.o.f." statistic is then 

-lnL = Nt 

whose mean is (— In L) = N, and variance is 
Var(— InL) = N, for an ensemble of data sets from 
the Hq p.d.f. A data set with t close enough to 1 will 
be claimed to be a good fit to the H p.d.f. 

But say, unknown to us, the data are really from a 
triangular p.d.f.: 

1-|£-1| (0<t<2) 

The mean and variance of Nt will be N and 7V/6 re- 
spectively, for data from the triangular distribution. 
So, although the exponential and triangular p.d.f. 's 
are quite different, the triangular data will be more 
likely to pass the g.o.f. test than exponential data for 
which it was intended. Statisticians refer to this situ- 
ation as a case of "test bias" . 

We conclude that, even with no free parameters, the 
"g.o.f." test is biased: there exist "impostor" p.d.f. 's 
that should produce bad fits, but instead pass the 
"g.o.f." test with greater probability then the p.d.f. 



2 Small correlations are not fatal. For example, if the P-value 
of g.o.f. for the observed data in a particular case ranged only 
between, say, 20% and 30%, for different true values within ±3cr 
of the estimated value of a parameter, one would be justified 
in concluding "good fit" (assuming the g.o.f. statistic used had 
the right properties in other respects). 



for which the test was designed. Reference |J| gives 
additional examples of this behavior. 

From the hypothesis test point of view, this behav- 
ior makes sense. The exponential and triangular data 
have the same "distance" from the flat distribution, on 
the average, with the triangular data being less sus- 
ceptible to fluctuations. The hypothesis test doesn't 
tell us when the data are inconsistent with both Hq 
and Hi. 



7. ANOTHER EXAMPLE 

Here we try to find an example p.d.f. (with a free 
parameter) that the method in question can handle 
well. We use the insight provided by the hypothe- 
sis test picture. We want to keep the correlation be- 
tween the free parameter and the g.o.f. statistic L max 
to a minimum. In the hypothesis test picture, this is 
achieved when the "flatness" of the p.d.f. is indepen- 
dent of the parameter. A location parameter has this 
property. Additionally, we want the p.d.f. to be eas- 
ily distinguishable from a flat p.d.f. So we choose the 
Gaussian 



1 



-0.5(x-/a) 2 /o 



27TCT 



where [i is unknown, but a is specified in advance. 
The likelihood is given by 



lnL = £ 



/— — , 1 / X; — U 

In V27T + In cr+ - ' 



When fi and a are both unknown, their m.l.e.'s are 
1 N 1 N 



Using these expressions, we can rewrite the likeli- 
hood in the form L(/x, a; /t, a): 



InL 



N 



ln(27r) +ln(cr 2 ) + 



° + (A - m) 



- 1 



When only ^ is unspecified, its m.l.e. is jl as above, 
and the value of the maximized likelihood is 



-InL 



ln(2^)+ln(a 2 ) + — 



Our victory is that L max only depends on <r, which 
is an ancillary statistic for fi. That is, we don't need 
to know the true value of fj, in order to calculate the 
distribution of our g.o.f. statistic in this carefully cho- 
sen example. In fact, a convenient form for the g.o.f. 
statistic is 
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which is well known to have the distribution (under 
the null hypothesis) of a x 2 with N — 1 degrees of 
freedom. 

7.1. The Bad News 

Before we declare that the method performs well in 
this example, there are several ugly facts to consider: 

• Data that match the null hypothesis well yield 
Na 2 /a 2 ~ N. Much larger or much smaller val- 
ues of the g.o.f. statistic imply poor g.o.f. This is 
in contrast to Pearson's \ 2 (binned x 2 ), for ex- 
ample, where smaller x 2 is always better g.o.f. 
So we must interpret this statistic differently 
than how we are used to. 

• The g.o.f. in this example simply reduces to a 
comparison between the sample variance and a 2 . 
Any distribution with variance approximately 
equal to a 2 will usually generate data that "pass 
the test", even distributions that look nothing 
like a Gaussian. This is the same kind of prob- 
lem that we first saw in section 

• A construction similar to that of section 
will produce "impostor" p.d.f.'s that pass the 
"g.o.f." test with greater frequency than the null 
hypothesis. So, we have not eliminated the test 
bias problem. 

In this example, the g.o.f. method in question will 
be able to flag some, but not all, of data samples 
that poorly match the null hypothesis. In answer to 
the question "Are the data from a Gaussian with un- 
specified mean, and variance equal to cr 2 ?", this g.o.f. 
method can only answer "No" or "Maybe" : it checks 
the variance part of the question, but does nothing to 
check the Gaussian part. 

8. CONCLUSIONS 

• This "g.o.f." method is fatally flawed in the un- 
binned case. Don't use it. Complain when you 
see it used. 

• With fixed p.d.f.'s, the method suffers from test 
bias, and is not invariant with respect to change 



of variables. These problems persist when there 
are floating parameters. 

• With floating parameters, the method is often 
circular: "g.o.f." becomes a comparison between 
the measured values and the true (but unknown) 
values of the parameters. . . 

• The misbehavior of this "g.o.f." statistic is un- 
derstandable when reinterpreted as the ratio 
between the likelihood in question and a uni- 
form likelihood, and used to distinguish between 
these two specific hypotheses. Dual-hypothesis 
tests are not g.o.f. tests. 
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